Healthcare Insurance Analysis¶

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
hospital = pd.read_csv("Hospitalisation details.csv")
medical = pd.read_csv("Medical Examinations.csv")
customer = pd.read_excel("Names.xlsx")
In [3]:
hospital.head()
Out[3]:
Customer ID year month date children charges Hospital tier City tier State ID
0 Id2335 1992 Jul 9 0 563.84 tier - 2 tier - 3 R1013
1 Id2334 1992 Nov 30 0 570.62 tier - 2 tier - 1 R1013
2 Id2333 1993 Jun 30 0 600.00 tier - 2 tier - 1 R1013
3 Id2332 1992 Sep 13 0 604.54 tier - 3 tier - 3 R1013
4 Id2331 1998 Jul 27 0 637.26 tier - 3 tier - 3 R1013
In [4]:
hospital.shape
Out[4]:
(2343, 9)
In [5]:
hospital.describe()
Out[5]:
date children charges
count 2343.000000 2343.000000 2343.000000
mean 15.554844 1.026035 13559.067870
std 8.721194 1.233847 11922.658415
min 1.000000 0.000000 563.840000
25% 8.000000 0.000000 5084.010000
50% 15.000000 0.000000 9634.540000
75% 23.000000 2.000000 17029.675000
max 30.000000 5.000000 63770.430000
In [6]:
hospital.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2343 entries, 0 to 2342
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Customer ID    2343 non-null   object 
 1   year           2343 non-null   object 
 2   month          2343 non-null   object 
 3   date           2343 non-null   int64  
 4   children       2343 non-null   int64  
 5   charges        2343 non-null   float64
 6   Hospital tier  2343 non-null   object 
 7   City tier      2343 non-null   object 
 8   State ID       2343 non-null   object 
dtypes: float64(1), int64(2), object(6)
memory usage: 164.9+ KB
In [7]:
medical.head()
Out[7]:
Customer ID BMI HBA1C Heart Issues Any Transplants Cancer history NumberOfMajorSurgeries smoker
0 Id1 47.410 7.47 No No No No major surgery yes
1 Id2 30.360 5.77 No No No No major surgery yes
2 Id3 34.485 11.87 yes No No 2 yes
3 Id4 38.095 6.05 No No No No major surgery yes
4 Id5 35.530 5.45 No No No No major surgery yes
In [8]:
medical.shape
Out[8]:
(2335, 8)
In [9]:
medical.describe()
Out[9]:
BMI HBA1C
count 2335.000000 2335.000000
mean 30.972649 6.578998
std 8.742095 2.228731
min 15.010000 4.000000
25% 24.600000 4.900000
50% 30.400000 5.810000
75% 36.300000 7.955000
max 55.050000 12.000000
In [10]:
medical.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2335 entries, 0 to 2334
Data columns (total 8 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Customer ID             2335 non-null   object 
 1   BMI                     2335 non-null   float64
 2   HBA1C                   2335 non-null   float64
 3   Heart Issues            2335 non-null   object 
 4   Any Transplants         2335 non-null   object 
 5   Cancer history          2335 non-null   object 
 6   NumberOfMajorSurgeries  2335 non-null   object 
 7   smoker                  2335 non-null   object 
dtypes: float64(2), object(6)
memory usage: 146.1+ KB
In [11]:
customer.head()
Out[11]:
Customer ID name
0 Id1 Hawks, Ms. Kelly
1 Id2 Lehner, Mr. Matthew D
2 Id3 Lu, Mr. Phil
3 Id4 Osborne, Ms. Kelsey
4 Id5 Kadala, Ms. Kristyn
In [12]:
customer.shape
Out[12]:
(2335, 2)
In [13]:
customer.describe()
Out[13]:
Customer ID name
count 2335 2335
unique 2335 2335
top Id1 Hawks, Ms. Kelly
freq 1 1
In [14]:
customer.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2335 entries, 0 to 2334
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Customer ID  2335 non-null   object
 1   name         2335 non-null   object
dtypes: object(2)
memory usage: 36.6+ KB

Collate the files so that all the information is in one place¶

In [15]:
df = pd.merge(pd.merge(hospital,medical,on='Customer ID'),customer,on='Customer ID')
In [16]:
df
Out[16]:
Customer ID year month date children charges Hospital tier City tier State ID BMI HBA1C Heart Issues Any Transplants Cancer history NumberOfMajorSurgeries smoker name
0 Id2335 1992 Jul 9 0 563.84 tier - 2 tier - 3 R1013 17.580 4.51 No No No 1 No German, Mr. Aaron K
1 Id2334 1992 Nov 30 0 570.62 tier - 2 tier - 1 R1013 17.600 4.39 No No No 1 No Rosendahl, Mr. Evan P
2 Id2333 1993 Jun 30 0 600.00 tier - 2 tier - 1 R1013 16.470 6.35 No No Yes 1 No Albano, Ms. Julie
3 Id2332 1992 Sep 13 0 604.54 tier - 3 tier - 3 R1013 17.700 6.28 No No No 1 No Riveros Gonzalez, Mr. Juan D. Sr.
4 Id2331 1998 Jul 27 0 637.26 tier - 3 tier - 3 R1013 22.340 5.57 No No No 1 No Brietzke, Mr. Jordan
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2330 Id5 1989 Jun 19 0 55135.40 tier - 1 tier - 2 R1012 35.530 5.45 No No No No major surgery yes Kadala, Ms. Kristyn
2331 Id4 1991 Jun 6 1 58571.07 tier - 1 tier - 3 R1024 38.095 6.05 No No No No major surgery yes Osborne, Ms. Kelsey
2332 Id3 1970 ? 11 3 60021.40 tier - 1 tier - 1 R1012 34.485 11.87 yes No No 2 yes Lu, Mr. Phil
2333 Id2 1977 Jun 8 0 62592.87 tier - 2 tier - 3 R1013 30.360 5.77 No No No No major surgery yes Lehner, Mr. Matthew D
2334 Id1 1968 Oct 12 0 63770.43 tier - 1 tier - 3 R1013 47.410 7.47 No No No No major surgery yes Hawks, Ms. Kelly

2335 rows × 17 columns

In [17]:
df.shape
Out[17]:
(2335, 17)

Check for missing values in the dataset¶

In [18]:
df.isna().sum().sum()
Out[18]:
0
In [ ]:
 

Find the percentage of rows that have trivial value (for example, ?), and delete such rows if they do not contain significant information¶

In [20]:
trivial = df[df.eq("?").any(axis=1)]
In [21]:
trivial.shape
Out[21]:
(10, 17)
In [22]:
round(trivial.shape[0]/df.shape[0]*100, 2)
Out[22]:
0.43
In [24]:
df.drop(df[df.eq("?").any(axis=1)].index, axis=0, inplace=True)
In [25]:
df.shape
Out[25]:
(2325, 17)

Use the necessary transformation methods to deal with the nominal and ordinal categorical variables in the dataset¶

In [26]:
from sklearn.preprocessing import LabelEncoder
In [27]:
l = LabelEncoder()
In [28]:
df['smoker'] = l.fit_transform(df['smoker'])
In [29]:
df['Any Transplants'] = l.fit_transform(df['Any Transplants'])
In [30]:
df['Heart Issues'] = l.fit_transform(df['Heart Issues'])
In [31]:
df['Cancer history'] = l.fit_transform(df['Cancer history'])
In [32]:
df['Cancer history'].value_counts()
Out[32]:
Cancer history
0    1934
1     391
Name: count, dtype: int64
In [33]:
df['Heart Issues'].value_counts()
Out[33]:
Heart Issues
0    1405
1     920
Name: count, dtype: int64
In [34]:
df['Any Transplants'].value_counts()
Out[34]:
Any Transplants
0    2183
1     142
Name: count, dtype: int64
In [35]:
df['smoker'].value_counts()
Out[35]:
smoker
0    1839
1     486
Name: count, dtype: int64
In [36]:
def fun(val):
    return int(val.replace("tier", "").replace(" ", "").replace("-", ""))
In [37]:
df['Hospital tier'] = df['Hospital tier'].map(fun)
In [38]:
df['City tier'] = df['City tier'].map(fun)
In [39]:
df
Out[39]:
Customer ID year month date children charges Hospital tier City tier State ID BMI HBA1C Heart Issues Any Transplants Cancer history NumberOfMajorSurgeries smoker name
0 Id2335 1992 Jul 9 0 563.84 2 3 R1013 17.580 4.51 0 0 0 1 0 German, Mr. Aaron K
1 Id2334 1992 Nov 30 0 570.62 2 1 R1013 17.600 4.39 0 0 0 1 0 Rosendahl, Mr. Evan P
2 Id2333 1993 Jun 30 0 600.00 2 1 R1013 16.470 6.35 0 0 1 1 0 Albano, Ms. Julie
3 Id2332 1992 Sep 13 0 604.54 3 3 R1013 17.700 6.28 0 0 0 1 0 Riveros Gonzalez, Mr. Juan D. Sr.
4 Id2331 1998 Jul 27 0 637.26 3 3 R1013 22.340 5.57 0 0 0 1 0 Brietzke, Mr. Jordan
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2329 Id6 1962 Aug 4 0 52590.83 1 3 R1011 32.800 6.59 0 0 0 No major surgery 1 Baker, Mr. Russell B.
2330 Id5 1989 Jun 19 0 55135.40 1 2 R1012 35.530 5.45 0 0 0 No major surgery 1 Kadala, Ms. Kristyn
2331 Id4 1991 Jun 6 1 58571.07 1 3 R1024 38.095 6.05 0 0 0 No major surgery 1 Osborne, Ms. Kelsey
2333 Id2 1977 Jun 8 0 62592.87 2 3 R1013 30.360 5.77 0 0 0 No major surgery 1 Lehner, Mr. Matthew D
2334 Id1 1968 Oct 12 0 63770.43 1 3 R1013 47.410 7.47 0 0 0 No major surgery 1 Hawks, Ms. Kelly

2325 rows × 17 columns

The dataset has State ID, which has around 16 states. All states are not represented in equal proportions in the data. Creating dummy variables for all regions may also result in too many insignificant predictors. Nevertheless, only R1011, R1012, and R1013 are worth investigating further. Create a suitable strategy to create dummy variables with these restraints.¶

In [40]:
df['State ID'].value_counts()
Out[40]:
State ID
R1013    609
R1011    574
R1012    572
R1024    159
R1026     84
R1021     70
R1016     64
R1025     40
R1023     38
R1017     36
R1019     26
R1022     14
R1014     13
R1015     11
R1018      9
R1020      6
Name: count, dtype: int64
In [41]:
Dummies = pd.get_dummies(df["State ID"], prefix= "State_ID")
Dummies
Out[41]:
State_ID_R1011 State_ID_R1012 State_ID_R1013 State_ID_R1014 State_ID_R1015 State_ID_R1016 State_ID_R1017 State_ID_R1018 State_ID_R1019 State_ID_R1020 State_ID_R1021 State_ID_R1022 State_ID_R1023 State_ID_R1024 State_ID_R1025 State_ID_R1026
0 False False True False False False False False False False False False False False False False
1 False False True False False False False False False False False False False False False False
2 False False True False False False False False False False False False False False False False
3 False False True False False False False False False False False False False False False False
4 False False True False False False False False False False False False False False False False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2329 True False False False False False False False False False False False False False False False
2330 False True False False False False False False False False False False False False False False
2331 False False False False False False False False False False False False False True False False
2333 False False True False False False False False False False False False False False False False
2334 False False True False False False False False False False False False False False False False

2325 rows × 16 columns

In [42]:
Dummy = Dummies[['State_ID_R1011','State_ID_R1012', 'State_ID_R1013']]
Dummy
Out[42]:
State_ID_R1011 State_ID_R1012 State_ID_R1013
0 False False True
1 False False True
2 False False True
3 False False True
4 False False True
... ... ... ...
2329 True False False
2330 False True False
2331 False False False
2333 False False True
2334 False False True

2325 rows × 3 columns

In [43]:
df = pd.concat([df,Dummy],axis=1)
In [44]:
df
Out[44]:
Customer ID year month date children charges Hospital tier City tier State ID BMI HBA1C Heart Issues Any Transplants Cancer history NumberOfMajorSurgeries smoker name State_ID_R1011 State_ID_R1012 State_ID_R1013
0 Id2335 1992 Jul 9 0 563.84 2 3 R1013 17.580 4.51 0 0 0 1 0 German, Mr. Aaron K False False True
1 Id2334 1992 Nov 30 0 570.62 2 1 R1013 17.600 4.39 0 0 0 1 0 Rosendahl, Mr. Evan P False False True
2 Id2333 1993 Jun 30 0 600.00 2 1 R1013 16.470 6.35 0 0 1 1 0 Albano, Ms. Julie False False True
3 Id2332 1992 Sep 13 0 604.54 3 3 R1013 17.700 6.28 0 0 0 1 0 Riveros Gonzalez, Mr. Juan D. Sr. False False True
4 Id2331 1998 Jul 27 0 637.26 3 3 R1013 22.340 5.57 0 0 0 1 0 Brietzke, Mr. Jordan False False True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2329 Id6 1962 Aug 4 0 52590.83 1 3 R1011 32.800 6.59 0 0 0 No major surgery 1 Baker, Mr. Russell B. True False False
2330 Id5 1989 Jun 19 0 55135.40 1 2 R1012 35.530 5.45 0 0 0 No major surgery 1 Kadala, Ms. Kristyn False True False
2331 Id4 1991 Jun 6 1 58571.07 1 3 R1024 38.095 6.05 0 0 0 No major surgery 1 Osborne, Ms. Kelsey False False False
2333 Id2 1977 Jun 8 0 62592.87 2 3 R1013 30.360 5.77 0 0 0 No major surgery 1 Lehner, Mr. Matthew D False False True
2334 Id1 1968 Oct 12 0 63770.43 1 3 R1013 47.410 7.47 0 0 0 No major surgery 1 Hawks, Ms. Kelly False False True

2325 rows × 20 columns

The variable NumberOfMajorSurgeries also appears to have string values. Apply a suitable method to clean up this variable.¶

In [45]:
df['NumberOfMajorSurgeries'] = df['NumberOfMajorSurgeries'].replace('No major surgery',0)
In [46]:
df['NumberOfMajorSurgeries'].value_counts()
Out[46]:
NumberOfMajorSurgeries
0    1070
1     961
2     272
3      22
Name: count, dtype: int64

Age appears to be a significant factor in this analysis. Calculate the patients' ages based on their dates of birth.¶

In [47]:
df['year'] = pd.to_datetime(df['year'], format='%Y').dt.year
df['year']
Out[47]:
0       1992
1       1992
2       1993
3       1992
4       1998
        ... 
2329    1962
2330    1989
2331    1991
2333    1977
2334    1968
Name: year, Length: 2325, dtype: int32
In [48]:
df['month'] = pd.to_datetime(df['month'], format='%b').dt.month
df['month']
Out[48]:
0        7
1       11
2        6
3        9
4        7
        ..
2329     8
2330     6
2331     6
2333     6
2334    10
Name: month, Length: 2325, dtype: int32
In [49]:
df['DateInt'] = df['year'].astype(str) + df['month'].astype(str).str.zfill(2) + df['date'].astype(str).str.zfill(2)
In [50]:
df['DOB'] = pd.to_datetime(df.DateInt, format='%Y%m%d')
In [51]:
df.drop(['DateInt'], inplace=True, axis=1)
In [52]:
import datetime as dt
current_date = dt.datetime.now()
In [53]:
df['Age'] = (((current_date-df.DOB).dt.days)/365).astype(int)
In [54]:
df
Out[54]:
Customer ID year month date children charges Hospital tier City tier State ID BMI ... Any Transplants Cancer history NumberOfMajorSurgeries smoker name State_ID_R1011 State_ID_R1012 State_ID_R1013 DOB Age
0 Id2335 1992 7 9 0 563.84 2 3 R1013 17.580 ... 0 0 1 0 German, Mr. Aaron K False False True 1992-07-09 31
1 Id2334 1992 11 30 0 570.62 2 1 R1013 17.600 ... 0 0 1 0 Rosendahl, Mr. Evan P False False True 1992-11-30 31
2 Id2333 1993 6 30 0 600.00 2 1 R1013 16.470 ... 0 1 1 0 Albano, Ms. Julie False False True 1993-06-30 30
3 Id2332 1992 9 13 0 604.54 3 3 R1013 17.700 ... 0 0 1 0 Riveros Gonzalez, Mr. Juan D. Sr. False False True 1992-09-13 31
4 Id2331 1998 7 27 0 637.26 3 3 R1013 22.340 ... 0 0 1 0 Brietzke, Mr. Jordan False False True 1998-07-27 25
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2329 Id6 1962 8 4 0 52590.83 1 3 R1011 32.800 ... 0 0 0 1 Baker, Mr. Russell B. True False False 1962-08-04 61
2330 Id5 1989 6 19 0 55135.40 1 2 R1012 35.530 ... 0 0 0 1 Kadala, Ms. Kristyn False True False 1989-06-19 34
2331 Id4 1991 6 6 1 58571.07 1 3 R1024 38.095 ... 0 0 0 1 Osborne, Ms. Kelsey False False False 1991-06-06 32
2333 Id2 1977 6 8 0 62592.87 2 3 R1013 30.360 ... 0 0 0 1 Lehner, Mr. Matthew D False False True 1977-06-08 46
2334 Id1 1968 10 12 0 63770.43 1 3 R1013 47.410 ... 0 0 0 1 Hawks, Ms. Kelly False False True 1968-10-12 55

2325 rows × 22 columns

The gender of the patient may be an important factor in determining the cost of hospitalization. The salutations in a beneficiary's name can be used to determine their gender. Make a new field for the beneficiary's gender.¶

In [55]:
def gen(x):
    if 'Ms.' in x:
        return 0
    else:
        return 1
In [56]:
df['Gender'] = df['name'].map(gen)
In [57]:
df['Gender']
Out[57]:
0       1
1       1
2       0
3       1
4       1
       ..
2329    1
2330    0
2331    0
2333    1
2334    0
Name: Gender, Length: 2325, dtype: int64

You should also visualize the distribution of costs using a histogram, box and whisker plot, and swarm plot.¶

In [58]:
# Histogram 
sns.histplot(df['charges'])
Out[58]:
<Axes: xlabel='charges', ylabel='Count'>
In [59]:
# box and whisker plot
sns.boxplot(df['charges'])
Out[59]:
<Axes: ylabel='charges'>
In [61]:
import warnings
warnings.filterwarnings("ignore")
In [62]:
# Swarm Plot
plt.figure(figsize=(30,10))
sns.swarmplot(x='year', y='charges', hue="Gender", data=df)
Out[62]:
<Axes: xlabel='year', ylabel='charges'>

State how the distribution is different across gender and tiers of hospitals¶

In [63]:
sns.countplot(data = df,x = 'Hospital tier', hue = 'Gender')
Out[63]:
<Axes: xlabel='Hospital tier', ylabel='count'>

Create a radar chart to showcase the median hospitalization cost for each tier of hospitals¶

In [64]:
df[df['Hospital tier']==1].charges.median()
Out[64]:
32097.434999999998
In [65]:
df[df['Hospital tier']==2].charges.median()
Out[65]:
7168.76
In [66]:
df[df['Hospital tier']==3].charges.median()
Out[66]:
10676.83
In [67]:
import plotly.express as px

df1 = pd.DataFrame(dict(
    r=[32097.434999999998,7168.76,10676.83],
    theta=['Tier 1','Tier 2','Tier 3']))
fig = px.line_polar(df1, r='r', theta='theta', line_close=True)
fig.show()

Create a frequency table and a stacked bar chart to visualize the count of people in the different tiers of cities and hospitals¶

In [68]:
city_freq = df["City tier"].value_counts().rename_axis('City&hospital_tier').reset_index(name='city_counts')
In [69]:
hospital_freq = df["Hospital tier"].value_counts().rename_axis('City&hospital_tier').reset_index(name='hospital_counts')
In [70]:
freq_table = pd.merge(city_freq, hospital_freq, on = 'City&hospital_tier')
In [71]:
freq_table
Out[71]:
City&hospital_tier city_counts hospital_counts
0 2 807 1334
1 3 789 691
2 1 729 300
In [72]:
x = freq_table['City&hospital_tier']
y1 = freq_table['city_counts']
y2 = freq_table['hospital_counts']
 
# plot bars in stack manner
plt.bar(x, y1, color='r')
plt.bar(x, y2, bottom=y1, color='b')
plt.show()

Test the following null hypotheses:¶

The average hospitalization costs for the three types of hospitals are not significantly different¶

In [73]:
from scipy.stats import ttest_1samp
In [74]:
from scipy.stats import friedmanchisquare
data1 = [32097.43]
data2 = [7168.76]
data3 = [10676.83]
stat, p = friedmanchisquare(data1, data2, data3)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
    print('Probably the same distribution')
else:
    print('Probably different distributions')
stat=2.000, p=0.368
Probably the same distribution

The average hospitalization costs for the three types of cities are not significantly different¶

In [75]:
print("median cost of tier 1 city:", df[df["City tier"]==1].charges.median())
print("median cost of tier 2 city:", df[df["City tier"]==2].charges.median())
print("median cost of tier 3 city:", df[df["City tier"]==3].charges.median())
median cost of tier 1 city: 10027.15
median cost of tier 2 city: 8968.33
median cost of tier 3 city: 9880.07
In [76]:
data1 = [10027.15]
data2 = [8968.33]
data3 = [9880.07]
stat, p = friedmanchisquare(data1, data2, data3)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
    print('Probably the same distribution')
else:
    print('Probably different distributions')
stat=2.000, p=0.368
Probably the same distribution

The average hospitalization cost for smokers is not significantly different from the average cost for nonsmokers¶

In [77]:
print("median cost of smoker:", df[df["smoker"]==1].charges.median())
print("median cost of non smoker:", df[df["smoker"]==0].charges.median())
median cost of smoker: 34125.475
median cost of non smoker: 7537.16
In [78]:
from scipy.stats import kruskal
data1 = [34125.475]
data2 = [7537.16]
stat, p = kruskal(data1, data2)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
    print('Probably the same distribution')
else:
    print('Probably different distributions')
stat=1.000, p=0.317
Probably the same distribution

Smoking and heart issues are independent¶

In [79]:
from scipy.stats import chi2_contingency
table = [[df["Heart Issues"].value_counts()],[df["smoker"].value_counts()]]
stat, p, dof, expected = chi2_contingency(table)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
    print('Probably independent')
else:
    print('Probably dependent')
stat=191.145, p=0.000
Probably dependent

Machine Learning¶

Examine the correlation between predictors to identify highly correlated predictors. Use a heatmap to visualize this.¶

In [80]:
df.drop(["Customer ID","State ID",'name', 'year', 'month', 'date', 'DOB'], inplace=True, axis=1)
In [81]:
df.head()
Out[81]:
children charges Hospital tier City tier BMI HBA1C Heart Issues Any Transplants Cancer history NumberOfMajorSurgeries smoker State_ID_R1011 State_ID_R1012 State_ID_R1013 Age Gender
0 0 563.84 2 3 17.58 4.51 0 0 0 1 0 False False True 31 1
1 0 570.62 2 1 17.60 4.39 0 0 0 1 0 False False True 31 1
2 0 600.00 2 1 16.47 6.35 0 0 1 1 0 False False True 30 0
3 0 604.54 3 3 17.70 6.28 0 0 0 1 0 False False True 31 1
4 0 637.26 3 3 22.34 5.57 0 0 0 1 0 False False True 25 1

Dropped those columns because they are not usable to model building

In [82]:
df.shape
Out[82]:
(2325, 16)
In [83]:
df.head()
Out[83]:
children charges Hospital tier City tier BMI HBA1C Heart Issues Any Transplants Cancer history NumberOfMajorSurgeries smoker State_ID_R1011 State_ID_R1012 State_ID_R1013 Age Gender
0 0 563.84 2 3 17.58 4.51 0 0 0 1 0 False False True 31 1
1 0 570.62 2 1 17.60 4.39 0 0 0 1 0 False False True 31 1
2 0 600.00 2 1 16.47 6.35 0 0 1 1 0 False False True 30 0
3 0 604.54 3 3 17.70 6.28 0 0 0 1 0 False False True 31 1
4 0 637.26 3 3 22.34 5.57 0 0 0 1 0 False False True 25 1
In [84]:
correlation = df.corr()
In [85]:
correlation
Out[85]:
children charges Hospital tier City tier BMI HBA1C Heart Issues Any Transplants Cancer history NumberOfMajorSurgeries smoker State_ID_R1011 State_ID_R1012 State_ID_R1013 Age Gender
children 1.000000 0.055901 -0.052438 -0.015760 -0.005339 -0.101379 0.023984 -0.142040 -0.027880 -0.113161 0.017713 0.011666 0.005247 -0.013834 -0.005457 0.011205
charges 0.055901 1.000000 -0.446687 0.035300 0.346730 0.139697 0.049299 -0.127028 -0.022522 0.053308 0.838462 0.286956 -0.074636 -0.150634 0.304395 0.034069
Hospital tier -0.052438 -0.446687 1.000000 -0.039755 -0.104771 0.057855 0.053376 0.011729 -0.021429 0.033230 -0.474077 -0.114685 0.020272 0.002455 0.133771 0.041261
City tier -0.015760 0.035300 -0.039755 1.000000 0.038123 -0.005404 0.023152 0.002970 -0.018639 0.027937 0.032034 0.036049 -0.018253 0.002766 -0.008070 0.054073
BMI -0.005339 0.346730 -0.104771 0.038123 1.000000 -0.006920 0.017129 0.015893 -0.020235 0.018851 0.107126 0.115671 0.017939 -0.208744 0.049260 0.079930
HBA1C -0.101379 0.139697 0.057855 -0.005404 -0.006920 1.000000 0.007699 -0.159855 -0.170921 -0.091594 0.007257 0.015525 -0.019513 0.033453 0.460558 -0.027339
Heart Issues 0.023984 0.049299 0.053376 0.023152 0.017129 0.007699 1.000000 -0.140269 0.111190 0.206147 -0.007159 0.005852 0.021770 -0.027967 0.192273 0.010277
Any Transplants -0.142040 -0.127028 0.011729 0.002970 0.015893 -0.159855 -0.140269 1.000000 -0.114677 0.158593 -0.025101 -0.058553 -0.066453 0.064563 -0.381084 -0.012737
Cancer history -0.027880 -0.022522 -0.021429 -0.018639 -0.020235 -0.170921 0.111190 -0.114677 1.000000 0.204208 0.006415 0.011919 0.058222 -0.066475 -0.101073 0.009359
NumberOfMajorSurgeries -0.113161 0.053308 0.033230 0.027937 0.018851 -0.091594 0.206147 0.158593 0.204208 1.000000 0.017199 0.000208 -0.002098 -0.002056 0.151442 -0.003349
smoker 0.017713 0.838462 -0.474077 0.032034 0.107126 0.007257 -0.007159 -0.025101 0.006415 0.017199 1.000000 0.336112 -0.106998 -0.094547 0.011939 0.020968
State_ID_R1011 0.011666 0.286956 -0.114685 0.036049 0.115671 0.015525 0.005852 -0.058553 0.011919 0.000208 0.336112 1.000000 -0.327054 -0.341085 0.008022 -0.075234
State_ID_R1012 0.005247 -0.074636 0.020272 -0.018253 0.017939 -0.019513 0.021770 -0.066453 0.058222 -0.002098 -0.106998 -0.327054 1.000000 -0.340296 -0.005229 -0.077093
State_ID_R1013 -0.013834 -0.150634 0.002455 0.002766 -0.208744 0.033453 -0.027967 0.064563 -0.066475 -0.002056 -0.094547 -0.341085 -0.340296 1.000000 -0.011926 -0.029639
Age -0.005457 0.304395 0.133771 -0.008070 0.049260 0.460558 0.192273 -0.381084 -0.101073 0.151442 0.011939 0.008022 -0.005229 -0.011926 1.000000 -0.007350
Gender 0.011205 0.034069 0.041261 0.054073 0.079930 -0.027339 0.010277 -0.012737 0.009359 -0.003349 0.020968 -0.075234 -0.077093 -0.029639 -0.007350 1.000000
In [86]:
plt.figure(figsize=(15,10))
sns.heatmap(correlation, annot=True, linewidth=.5, cmap="crest")
plt.show()

2. Develop and evaluate the final model using regression with a stochastic gradient descent optimizer. Also, ensure that you apply all the following suggestions:¶

• Perform the stratified 5-fold cross-validation technique for model building and validation • Use standardization and hyperparameter tuning effectively • Use sklearn-pipelines • Use appropriate regularization techniques to address the bias-variance trade-off

a. Create five folds in the data, and introduce a variable to identify the folds¶

b. For each fold, run a for loop and ensure that 80 percent of the data is used to train the model and the remaining 20 percent is used to validate it in each iteration¶

c. Develop five distinct models and five distinct validation scores (root mean squared error values)¶

d. Determine the variable importance scores, and identify the redundant variables¶

In [87]:
# lets first seperate the input and output data.
x = df.drop(["charges"], axis=1)
y = df[['charges']]
In [88]:
from sklearn.model_selection import train_test_split
In [89]:
x_train, x_test, y_train, y_test  = train_test_split(x,y, test_size=.20, random_state=10)
In [90]:
from sklearn.preprocessing import StandardScaler
In [91]:
sc = StandardScaler()
In [92]:
x_train = sc.fit_transform(x_train)
x_test = sc.fit_transform(x_test)
In [93]:
# Stochastic gradient descent optimizer
from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import GridSearchCV
In [94]:
params = {'alpha': [0.0001, 0.001, 0.01, 0.05, 0.1, 0.2,0.3,0.4,0.5,
                   0.6,0.7,0.8,0.9,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,
                   9.0,10.0,20,50,100,500,1000],
         'penalty': ['l2', 'l1', 'elasticnet']}

sgd = SGDRegressor()

# Cross Validation 
folds = 5
model_cv = GridSearchCV(estimator = sgd,
                       param_grid = params,
                       scoring = 'neg_mean_absolute_error',
                       cv = folds,
                       return_train_score = True,
                       verbose = 1)
model_cv.fit(x_train,y_train)
Fitting 5 folds for each of 84 candidates, totalling 420 fits
Out[94]:
GridSearchCV(cv=5, estimator=SGDRegressor(),
             param_grid={'alpha': [0.0001, 0.001, 0.01, 0.05, 0.1, 0.2, 0.3,
                                   0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 2.0, 3.0,
                                   4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 20, 50,
                                   100, 500, 1000],
                         'penalty': ['l2', 'l1', 'elasticnet']},
             return_train_score=True, scoring='neg_mean_absolute_error',
             verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5, estimator=SGDRegressor(),
             param_grid={'alpha': [0.0001, 0.001, 0.01, 0.05, 0.1, 0.2, 0.3,
                                   0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 2.0, 3.0,
                                   4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 20, 50,
                                   100, 500, 1000],
                         'penalty': ['l2', 'l1', 'elasticnet']},
             return_train_score=True, scoring='neg_mean_absolute_error',
             verbose=1)
SGDRegressor()
SGDRegressor()
In [95]:
model_cv.best_params_
Out[95]:
{'alpha': 0.7, 'penalty': 'l1'}
In [96]:
sgd = SGDRegressor(alpha= 100, penalty= 'l1')
In [97]:
sgd.fit(x_train, y_train)
Out[97]:
SGDRegressor(alpha=100, penalty='l1')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SGDRegressor(alpha=100, penalty='l1')
In [98]:
sgd.score(x_test, y_test)
Out[98]:
0.870999188185268
In [99]:
y_pred = sgd.predict(x_test)
In [100]:
from sklearn.metrics import mean_squared_error, mean_absolute_error
In [101]:
sgd_mae = mean_absolute_error(y_test, y_pred)
sgd_mse = mean_squared_error(y_test, y_pred)
sgd_rmse = sgd_mse*(1/2.0)
In [102]:
print("MAE:", sgd_mae)
print("MSE:", sgd_mse)
print("RMSE:", sgd_rmse)
MAE: 2651.366890864599
MSE: 18775743.771674618
RMSE: 9387871.885837309
In [103]:
importance = sgd.coef_
In [104]:
pd.DataFrame(importance, index = x.columns, columns=['Inportance Score'])
Out[104]:
Inportance Score
children 344.381045
Hospital tier -1035.362293
City tier 0.000000
BMI 2666.030914
HBA1C 121.528657
Heart Issues 0.000000
Any Transplants 0.000000
Cancer history 0.000000
NumberOfMajorSurgeries 0.000000
smoker 8976.649018
State_ID_R1011 0.000000
State_ID_R1012 0.000000
State_ID_R1013 -263.622647
Age 3407.989315
Gender 0.000000

Use random forest and extreme gradient boosting for cost prediction, share your crossvalidation results, and calculate the variable importance scores

Random Forest Algorithm¶

In [105]:
from sklearn.ensemble import RandomForestRegressor
In [106]:
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)
In [107]:
rf.fit(x_train, y_train)
Out[107]:
RandomForestRegressor(n_estimators=1000, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor(n_estimators=1000, random_state=42)
In [108]:
score = rf.score(x_test,y_test)
score
Out[108]:
0.9016674828639792
In [109]:
y_pred = rf.predict(x_test)
In [110]:
rf_mae = mean_absolute_error(y_test, y_pred)
In [111]:
rf_mae
Out[111]:
1942.660224473113

Extreme gradient boosting¶

In [112]:
from sklearn.ensemble import GradientBoostingRegressor
In [113]:
gbr = GradientBoostingRegressor(n_estimators = 1000, random_state = 42)
In [114]:
gbr.fit(x_train, y_train)
Out[114]:
GradientBoostingRegressor(n_estimators=1000, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingRegressor(n_estimators=1000, random_state=42)
In [115]:
score = gbr.score(x_test,y_test)
score
Out[115]:
0.8654331231806346
In [116]:
y_pred = gbr.predict(x_test)
In [117]:
gbr_mae = mean_absolute_error(y_test, y_pred)
gbr_mae
Out[117]:
2439.1036419593333

1. Case scenario: Estimate the cost of hospitalization for Christopher, Ms. Jayna (her date of birth is 12/28/1988, height is 170 cm, and weight is 85 kgs). She lives in a tier-1 city and her state’s State ID is R1011. She lives with her partner and two children. She was found to be nondiabetic (HbA1c = 5.8). She smokes but is otherwise healthy. She has had no transplants or major surgeries. Her father died of lung cancer. Hospitalization costs will be estimated using tier-1 hospitals.¶

In [118]:
date = str(19881228)
date1 = pd.to_datetime(date, format = "%Y%m%d")
In [119]:
current_date = dt.datetime.now()
current_date
Out[119]:
datetime.datetime(2024, 2, 10, 13, 41, 7, 945468)
In [120]:
age = (current_date - date1)
age
Out[120]:
Timedelta('12827 days 13:41:07.945468')
In [121]:
age = int(12421/365)
age
Out[121]:
34

So the age of Christopher, Ms. Jayna is 34

In [122]:
height_m = 170/100
height_sq = height_m*height_m
BMI = 85/height_sq
np.round(BMI,2)
Out[122]:
29.41

BMI is 29.41

In [123]:
df.columns
Out[123]:
Index(['children', 'charges', 'Hospital tier', 'City tier', 'BMI', 'HBA1C',
       'Heart Issues', 'Any Transplants', 'Cancer history',
       'NumberOfMajorSurgeries', 'smoker', 'State_ID_R1011', 'State_ID_R1012',
       'State_ID_R1013', 'Age', 'Gender'],
      dtype='object')
In [124]:
df.drop(['charges'],inplace=True,axis=1)
In [125]:
list = [[2,1,1,29.41,5.8,0,0,0,0,1,1,0,0,34,0]]
In [126]:
df = pd.DataFrame(list, columns = ['children','Hospital tier', 'City tier', 'BMI', 'HBA1C','Heart Issues', 'Any Transplants', 
                              'Cancer history','NumberOfMajorSurgeries', 'smoker', 'State_ID_R1011', 'State_ID_R1012',
                              'State_ID_R1013', 'age', 'gender'] )
df
Out[126]:
children Hospital tier City tier BMI HBA1C Heart Issues Any Transplants Cancer history NumberOfMajorSurgeries smoker State_ID_R1011 State_ID_R1012 State_ID_R1013 age gender
0 2 1 1 29.41 5.8 0 0 0 0 1 1 0 0 34 0

Find the predicted hospitalization cost using all five models. The predicted value should be the mean of the five models' predicted values'.¶

In [127]:
Hospitalization_cost = []
In [128]:
# Now lets predict the hospitalization cost through SGDRegressor
Cost1 = sgd.predict(df)
Hospitalization_cost.append(Cost1)
In [129]:
# Now lets predict the hospitalization cost through Random Forest
Cost2 = rf.predict(df)
Hospitalization_cost.append(Cost2)
In [130]:
# Now lets predict the hospitalization cost throug Extreme gradient Booster
Cost3 = gbr.predict(df)
Hospitalization_cost.append(Cost3)
In [131]:
Hospitalization_cost
Out[131]:
[array([217142.19936608]), array([45115.32171]), array([49915.6762532])]
In [132]:
avg_cost = np.mean(Hospitalization_cost)
avg_cost
Out[132]:
104057.73244309246

The average predicted hospitalization cost is 104814.30